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Configuration-space matrix elements of Af-body potentials arise naturally and ubiquitously in the 
Ritz-Galerkin solution of many-body quantum problems. For the common specialization of local, 
finite-range potentials, we develop the Tensor HyperContraction (THC) method, which provides 
a quantized renormalization of the coordinate-space form of the iV-body potential, allowing for a 
highly separable tensor factorization of the configuration-space matrix elements. This representation 
allows for substantial computational savings in chemical, atomic, and nuclear physics simulations, 
particularly with respect to difficult "exchange-like" contractions. 

PACS numbers: 



The physics of many-body quantum systems is of- 
ten captured by local, finite-range iV-body potentials 
V(ri, . . . ,Tn). Key examples include the Coulomb po- 
tential of atomic and molecular physics, the Yukawa 
potential of particle physics, and the effective Gogny 
pseudo-potential encountered in nuclear structure and 
nuclear astrophysics [Tj. 

Given some real, finite, one-particle Ritz-Galerkin ba- 
sis set {ipi(r)}, the configuration-space representation of 
V is the spatial integral tensor [2], 

(i . . .n\V\i' . . .n') = J dri . . . J dr N 
ipi(r x ) . . .^„(r JV )V r (T'i,. . .,r N )i) V {n) . . .if) n ,(r N ). (1) 

The generation, manipulation, and storage of this tensor 
is a major hurdle in many-body quantum simulations. 

Tensor HyperContraction (THC) is a general technique 
developed to obtain computational savings through ten- 
sor decomposition of the configuration-space integral ten- 
sor ([I]). We first introduced THC in the context of elec- 
tronic structure theory, where we have used several vari- 
ants to provide approximate resolution of the ubiquitous 
electron repulsion integral tensor, which encapsulates the 
Coulomb potential between electrons [3H5]- 

In this letter, we show that an exact THC decomposi- 
tion (X-THC) is possible for specific choices of basis sets 
that are widely used, especially in nuclear and atomic 
physics. Aside from providing significant gains for prob- 
lems involving these bases, this finding also explains the 
unexpected accuracy of approximate variants of THC in 
more general basis sets. 

Below, we first demonstrate the key features of the 
X-THC representation through the pedagogical, but en- 
tirely representative example of a one-dimensional, two- 



body problem in Cartesian coordinates using Hermite 
functions. The _D-dimensional, TV-body generalization 
of X-THC is then presented. Finally, the approximate 
least-squares THC method for non-polynomial basis sets 
is briefly explained, and generalized to A-body poten- 
tials. 

X-THC Example - Consider a one-dimensional (D = 
1) problem in Cartesian coordinates, involving a finite 
basis of M + 1 Hermite functions {ipi(x)} (labeled from 
to M) with a local two-body (A — 2) potential V = 
V(x±, x 2 )- The potential matrix elements are, 

(ij\V\i'f) = JJd Xl dx 2 

ipi(xi)ipj(x 2 )V(xi, x 2 )ipi> {xi)ipf (x 2 ). (2) 

The first stage in X-THC is to note that all (M +1) 2 prod- 
ucts tjji(xi)ijji'(xi) are exactly spanned by an orthonor- 
mal "auxiliary" basis {xa(%i)} consisting of 2M + 1 
Hermite functions with a slightly modified spatial range, 
Xa{x\) = iPa(V2%i) US- Using the generalized Einstein 
summation convention here and throughout, 

ipi(xi)ipi'(xi) = [H'A]xa(xi), (3) 

where, 

[ii'A] = / dxi ^i(xi)tl)i>(xi)xA{xi)- (4) 

JR 

This step is analogous to the popular Density Fit- 
ting (DF) procedure of electronic structure theory [7HS], 
though here it is exact due to the closure properties of 
the polynomial-based Hermite functions. The integrals 
are now given as, 

(ij\V\i'f) = [ii'A] [jj'B] G AB , (5) 
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where, 



G 



AB 



dxi dx 2 Xa(xi)xb(x 2 )V(x 1i x 2 ). (6) 



Thus, the fourth-order integral tensor is expressed as a 
product of second- and third-order tensors. Even though 
we have compressed the fourth-order tensor, this repre- 
sentation still precludes scaling reduction in "exchange- 
like" terms. A canonical example of such a term is the 
pairing field in the Hartree-Fock-Bogoliubov theory. This 
is analogous to the exchange term of Hartree-Fock, but 
easier to generalize for iV-body operators, 



A tJ = (ij\V\i'j'} Ki/f - [u'A][j?B]G ab k 



(7) 



where n is the pairing tensor. Despite the factorization, 
computing this term still scales as 0{M A ) = 0{M 2ND ). 

The critical step in THC is to resolve the three-index 
overlap integral [ii'A] to "unpin" the indices i and i' 
across some additional linear-scaling index P. That 
is, we seek a PARAFAC-like decomposition of the form 
[ii'A] = XfXfY£, 10J where the range of P is 0(M). 
Thanks to the choice of a polynomial basis, the overlap 
integral is exactly integrated by a 2M + 1-node Gaus- 
sian quadrature (in this case, Gauss-Hermite) defined by 
the nodes and weights {< xp,wp >} [TT]. Therefore, 
the quadrature grid index provides a natural PARAFAC 
decomposition of the overlap integral, 



[ii'A] = wptpi(xp)ipif(x P )xA(xp). 



(8) 



This is reminiscent of the discrete variable representa- 
tion [T2TH1] or pseudospectral [T5] techniques of chemical 
physics. Defining X p = ipi(xp) and Y p = wpxa(xp), 
we can form the intermediate object, 



z p Q = y p g ab y9 . 



(9) 



The full integral ^ is thus expressed as, 

(ij\v\i'j') = xfx?z p Qx p xf,. (10) 

This X-THC representation of the integral tensor is the 
key for the exact 0(M 3 ) = 0(M ND+1 ) treatment of 
the pairing term, via several intermediate summations, 
indicated here by brackets for clarity, 



— Xf Xj Z p ® ' Xf, ' X-, Ki'ji 
~ Z PQ 



= X 



X* 



Xj, Ki'j' 



(11) 



Interpretation - At first glance, the Z operator is a 
mere mathematical intermediate, but there exists a much 
richer interpretation: it is a quantized renormalization 
of the coordinate-space representation of the potential 
operator V. To see this, we first consider the continuous, 
renormalized potential operator V, defined as, 



This operator is not equivalent to the original in physical 
space, i.e., V{x\,x 2 ) 7^ V(xi,x 2 ), yet the matrix ele- 
ments of both operators are identical, i.e., (ij\V\i'j') — 
(ij\V\i'j'}. The renormalized operator is simply the raw 
operator V with all components outside of the finite prod- 
uct space {ipi{xi)ijji'(xi)} <^ {\a(xi)} projected out in 
each coordinate. This projection is serendipitous: the 
coordinate-space integrand involving V and the products 
of basis functions are exactly resolved by the Gaussian 
quadrature for the auxiliary basis, while the correspond- 
ing integrand for V is not exact under any finite quadra- 
ture due to the presence of "alias" components outside 
of {ipi(xi)ipi> (xi)}. Applying the Gaussian quadrature, 
we can quantize the renormalized operator V to produce 
the discrete operator V, adding quadrature weights to 
account for the spatial contribution of each point, 

V r (xi,x 2 ) = w p wq8{x 1 - x P )8(x 2 - x Q )V(xi,x 2 ). (13) 

As with V, the matrix elements of V are identical to 
those of V. Integrating V instead of V naturally exposes 
the X-THC factorization, 

(ij\V\i'f) = (ij\V\i'f) 

dxi dx 2 i'i{x 1 )^j(x 2 )V{xi,x 2 )^i l {xi)ipj l {x 2 ) 

= xfxfz p Qxfxf,. (14) 

Here, the elements Z p ® are simply the quantized values 
of the renormalized potential, with the weights rolled in, 
i.e., Z p< °* — wpwqV(xp,xq). An example involving a 
Gaussian potential in Hermite functions is shown in Fig- 
ure [T] The renormalized potential (right) clearly shows 
the effects of projection from the raw potential (left). 
The locations of the quantization to Z PC! (the positions 
at which V can be discretized in a lossless manner) are 
indicated with small white x's on the right. 
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V{xi,x 2 ) = xa(xi)xb{x 2 )G 



AB 



(12) 



FIG. 1: (color online) Example of the X-THC process for 
a one-dimensional, two-body Gaussian potential V(xi,X2) = 
exp(— x\ 2 ) in Hermite functions {ipi(x)} up to M — 5. Left 
panel: raw^ii,^)- Right panel: renormalized, quantizable 
V(xi,x 2 ). 



Generalized X-THC - The generalization of the one- 
dimensional, two-body, Hermite function example above 
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to iV-body potentials in D-dimensions and other choices 
of polynomial direct-product bases is quite straightfor- 
ward. 

For X-THC to hold, the one-particle basis must be of 
the D-dimensional direct-product polynomial type, i.e., 
^i( r ) = I\^=i F i„( r ^) v ^( r ^)- In each dimension n, P if[ 
is a polynomial of up to degree and w M is an arbi- 
trary weight function (analogous to the Gaussian term 
in the Hermite functions above). Such basis sets are 
widely used in atomic and nuclear many-body physics in 
Cartesian and other coordinate systems. Use of a direct- 
product polynomial basis automatically guarantees clo- 
sure: for the M M + 1 functions in the /nth dimension, the 
span < ipi^r^ijJi' (r^) > lies wholly inside a 2M^ + 1- 
function auxiliary basis, defined by a set of polynomials 



orthogonal with respect to the weight ^(ry, 



Addi- 



tionally, all quadratic products of auxiliary functions are 
exactly integrated by a 2M^ + 1-node Gaussian quadra- 
ture {< Tp ,Wp >} which can always be found, e.g., 
using the Golub-Welsch algorithm [15] . 

These properties allow for the X-THC factorization, 



(i...n\V\i' ...n') = X{ 



W rrP...W 



.XTZ 



X. 



.X 



(15) 

with each Xf being the direct product of the D under- 
lying xf*. Z p - W is the generalization of (JoJ) above to 

the case with iV-body auxiliary integrals G A ' N . 

Within the X-THC representation, the entirely rep- 
resentative generalization of the pairing term, Ai... n = 
(i...n\V\i' ...n')n v ... n >, now scales as 0(M^ D+1 ), 
rather than 0(M^ ND ), with no approximation or restric- 
tion on the form of the local, finite-range potential V. 

It is worth noting that some alternative techniques to 
reduce the cost of treating exchange-like terms involve 
approximating the potential to be direct-product separa- 
ble over N w terms, e.g., by approximating the Coulomb 
operator as a sum of separable Gaussians [T7J |T5] . This 
reduces the cost of forming the generalized pairing tensor 
to 0{M* D+N ). X-THC can be applied to this approx- 
imate w-separable potential, producing an 0(M^ D+1 ) 
implementation. However, it is important to note that 
with X-THC, the w-separable form gives no scaling ad- 
vantage and can only reduce the prefactor and memory 
requirements. Thus, the X-THC formalism largely ob- 
viates the impetus for w-separable approximations. A 
summary of the scaling reductions afforded with various 
factorization approaches and local potentials is shown in 
Table H 

Practical Demonstration - To illustrate the numeri- 
cal equivalence and practical utility of the X-THC ap- 
proach, a hybrid MATLAB/C++ code was developed to 
produce generalized pairing fields for Z?-dimensional, N- 
body forces in Hermite functions. A complete description 
of the code is presented in the supplemental material. 

We have verified that the X-THC generalized pair- 
ing fields are exact within machine precision (as ex- 



TABLE I: Comparison of computational scalings for the pair- 
ing term, using a variety of approaches and types of local 
potentials. M M is the order of the polynomial basis in the uth 
degree of freedom, and the potential is iV-body in D dimen- 
sions. For simplicity, we consider the isotropic case where Mfj, 
is the same in all dimensions in this comparison. iV^ is the 
number of terms retained in a separable approximation to the 
potential. 



Approach 


General Local 


w- Separable Local 


Conventional/DF 
X-THC 


0{Ml NU ) 
0(M? D+1 ) 


0(N w M? D+1 ) 



pected mathematically) . Figure [2] shows the computa- 
tional gains which can be achieved from the X-THC fac- 
torization using a representative example of N = 2 and 
D = 1,2,3. For a general local potential, X-THC is 
several orders of magnitude faster than conventional ap- 
proaches for the largest M M studied here. When the po- 
tential is written in w-separable form, the X-THC scal- 
ing advantage is less dramatic, but X-THC becomes less 
costly for the largest M M used in Figure [2j The X-THC 
approach allows one to retain the general local poten- 
tial and calculate the exact pairing tensor in similar (or 
even less) computational effort as with an approximate 
w-separable potential. 
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FIG. 2: 
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(color online) Wall times for pairing tensor formation 
2 (log-log scale). 



LS-THC - In some cases, more exotic bases than poly- 
nomial types are preferred. For example, atom-centered 
Gaussian functions are often used to solve the electronic 
Schrodinger equation in molecular physics. These basis 



4 



sets are not of direct product form, and are character- 
ized simply by the number of basis functions M , which 
is typically proportional to the number of atoms. 

For non-polynomial choices of the basis functions, 
there does not exist an exact linear-span closure, i.e. one 
cannot choose a set of O(M) functions which spans the 
space formed by the M 2 products of the original basis 
functions. However, the locality of the basis functions 
implies that an approximate linear-span closure might 
be possible. One could simply use the X-THC frame- 
work with a quadrature grid and auxiliary basis, with 
0(M) points and functions, respectively. Preliminary 
investigations indicate this "pseudospectral THC" (PS- 
THC) approximation requires many points to achieve the 
desired accuracy for two-electron Coulomb integrals. 

A straightforward "least-squares THC" (LS-THC) 
modification uses a quadrature grid to form the X opera- 
tor, as suggested above, but then additionally chooses the 
Z operator to minimize the vector 2-norm of the residual 
error in the desired integral tensor. A decomposition of 



the form of ( 15 ) results, where now the tuned Z largely 



corrects for deficiencies in the zeroth-order quadrature 
grid. For an TV-body potential, this method yields the 
analytical formula for Z, 



7P...W 



q-l 

D P'P ■ • 



q-l 

■ °W'W 



X 



p' 



X%{i...n\V\i'...ri), (16) 



where the joint collocation is X?, = XfXf, and the 
physical-space metric matrix is Sp>p = X?,X?,. Wc 
have already demonstrated the accuracy and efficiency 
of LS-THC for the two-body Coulomb potential in quan- 
tum chemistry 0J [5] . 

Summary and Outlook - In this Letter, we have 
demonstrated that a quantized renormalization of the 
coordinate-space form of any local, spatial, finite-range 
A-body potential can be used to produce a powerful ten- 
sor factorization of the configuration-space integrals, a 
form denoted as Tensor HyperContraction or THC. The 
THC approach leads to remarkable scaling reductions. 

When the basis set used is of polynomial direct prod- 
uct type, the X-THC factorization is exact, simple to 
form, and based on the 2M^ + 1-node Gaussian quadra- 
tures related to the orthonormal polynomials in the basis. 
For basis sets which are not of direct-product polynomial 
form, LS-THC can be viewed as a practical approxima- 
tion to X-THC, and is based on determining the approx- 
imate quantized physical-space representation of the po- 
tential by minimizing the 2-norm of the residual in the 
integrals, under the constraint of a preselected colloca- 
tion grid. 

In molecular physics, LS-THC separability of the ERI 
tensor provides an accurate and efficient treatment of 
many difficult terms in correlated methods. In nuclear 
physics, the potential for THC may be even brighter: 



the vast majority of basis sets are of the direct-product 
polynomial form, and can immediately benefit from ap- 
plication of X-THC, using the full span of local A-body 
potentials. 
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PACS numbers: 

NOTATION 

A rather large number of indices appear in the primary 
manuscript, so we summarize them here for clarity. 

Particle Number: The particle number is denoted im- 
plicitly by the presence of ellipses in relevant equa- 
tions. This index ranges from 1 to N. 

Dimension: Each degree of freedom is indexed by /u and 
ranges from 1 to D for each particle. In direct- 
product bases (e.g., X-THC), dimensionality plays 
a major role, as the total number of primary basis 
functions, auxiliary basis functions, and quadra- 
ture grid points scale as O(M^) In non-direct- 
product bases (e.g., LS-THC), dimensionality has 
little meaning [2], so D is usually assumed to be 
1. In a non-direct-product basis, the number of 
primary basis functions, auxiliary basis functions, 
and quadrature grid points all generally scale as 
0{M). 

Primary Basis: The primary single-particle basis is de- 
noted by the indices i to n (bra), and i' to n! (ket). 
In a direct-product basis, i is a composite index 
corresponding to the underlying direct-product of 
1-dimensional primary basis functions, e.g., \i) = 
\ix)\iy)\iz) ■ in a polynomial direct-product basis, 
the 1-dimensional primary basis functions for the 
/zth degree of freedom range from to (the 
zero is a consequence of the polynomial definition 
of the basis). In a non-direct-product basis, the 
basis function indices range from 1 to M. 

Auxiliary Basis: Auxiliary basis indices are denoted by 

the indices A, B, In a direct-product basis, A 

is a composite index corresponding to the underly- 
ing direct-product of 1-dimensional auxiliary basis 
functions, e.g., \A) = \A x )\A y )\A z ) . In a polyno- 
mial direct-product basis, the 1-dimensional auxil- 
iary basis functions in dimension fj, range from to 



2M M (the zero is a consequence of the polynomial 
definition of the basis). In a non-direct-product ba- 
sis, the full auxiliary basis functions range from 1 
to Ma, where increasing Ma increases the fidelity 
of the approximate density fitting procedure. 

Quadrature Grid: Quadrature grid indices are de- 
noted by the indices P,Q, In a direct-product 

basis, P is a composite index corresponding to the 
union of the underlying one-dimensional quadra- 
tures, e.g., r P = (x Px ,y Py ,zp z ). In a polyno- 
mial direct-product basis, the indices for an ex- 
act quadrature for the /uth degree of freedom range 
from to 2M M (the zero is a consequence of the 
polynomial definition of the basis). In a non-direct- 
product basis, the LS-THC quadrature grid con- 
tains Mp points, where increasing Mp increases 
the fidelity of the approximate LS-THC procedure. 

For cases where two classes of indices are required to re- 
solve a tensor element, a double subscript is used, e.g., 
the P-th grid point for the /zth degree of freedom is de- 
noted rp M . Also note that we use the generalized Einstein 
convention in this work: a repeated index on the right 
side of an equation is contracted over if it appears twice, 
or hypercontracted over if it appears more than twice, so 
long as the same index is not present on the left side of 
the same equation. 



FULL iV-BODY D-DIMENSIONAL X-THC 

For clarity, we provide explicit definition of the full 
TV-body D-dimensional X-THC representation here. 

A direct-product basis founded on polynomials has the 
form, 

D 

Mr)= n P *»fM- (!) 
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In each dimension ji, Pi is a polynomial of up to degree 
ifj_ , and is a weight function, typically chosen to bring 
the polynomial into £2(0), where D is the domain of the 
problem, and also to provide qualitative conformation 
to some a priori knowledge of the future shape of the 
wavefunction. The i M index ranges from to M M , so the 
polynomials range up to a maximum degree of M„. Note 
that these polynomials do not have to be orthogonal, 
though they are often defined to be so. The product 
of polynomials being itself a polynomial of up to order 
2M M , the local product ip* {r^ijiv (r^) in dimension {m 
lies inside the span of the 2M M + 1 auxiliary functions, 

XAM=P Afl (r^\v^)\ 2 , (2) 

where Pa^ is a polynomial of up to order 2M M , often 
different from the primary basis polynomials Pi^ . We 
will choose these auxiliary functions to be orthonormal 
for convenience (this avoids DF metric matrices). Note 
that for a (generally complex) polynomial-based primary 
basis {ipi (r^)}, we can always choose an exact, wholly 
real, orthonormal auxiliary basis {XAa{ r n)} with 2M^ + 
1 functions. Thus, most of the complex conjugates on 
the auxiliary basis functions below are included only for 
convenience in the case that this work should need to be 
generalized to complex auxiliary bases. 

The auxiliary functions xa„ (?>) yield an exact DF rep- 
resentation for this basis, 

(i . . . n\V\i' ...«') = [ii'A] . . . [nn'N]G A - N . (3) 

The overlap integrals are separable in coordinates, 

D 

[ii'A] = JJ[V^1- ( 4 ) 

However, in general, the auxiliary potential integrals are 
not separable, 

G A - N = Jdn ... j dr N 

XA{ri)---X*N{rN)V(n,...,r N ). (5) 

To produce the THC representation, it remains to find 
an exact quadrature for the three-index overlap integrals. 
Owing to the closure in the product i^i^, any quadrature 
which can exactly integrate the auxiliary overlap metric 
[B^Afj] = Sb„a„ can exactly integrate the three-index 
overlap integrals [i^i^A^], as the spans are identical in 
both cases. A quadrature which can exactly integrate 
all quadratic products of functions based on orthogonal 
polynomials of up to degree 2M M is precisely the defini- 
tion of the 2M M + 1-node Gaussian quadrature. Regard- 
less of the choice of weight f M (r p ), the nodes and roots of 
this Gaussian quadrature can always be determined effi- 
ciently by the Golub-Welsch algorithm, giving the nodes 



and weights {< r P ^,wp^ >} The overlap integral is then 
exactly resolved as, 

[Vji M = ^IJSP^ ^ ( r P„ ) XAjrp^wp^ . (6) 

Note that the contraction index P^ occurs not twice (as in 
a standard contraction operation) but three times (what 
we refer to as a "hypercontraction"). 

In the full direct-product basis, we will use the col- 
lapsed notation for the collocations of this Gaussian 
quadrature, 

^l— 1 M— 1 

to save space. However, in real implementation, the 
direct-product separability of these two quantities is very 
important to achieve near-optimal scaling in formation of 
the X-THC factorization and subsequent utilization. 
With these definitions, the X-THC Z operator reads 

Z P-W = yP Y W Q A...N^ (8) 

which yields the the full iV-body D-dimensional X-THC, 

(i . . . n\V\i' . . . n) = Xf . . . X^Z p - w Xf, . . . X% . 

(9) 

The Z operator is not, in general, direct-product sepa- 
rable, but the factors X and Y arc. In forming Z, the 
contraction of the Y factors with the auxiliary poten- 
tial integrals would naively scale as 0(M^ D+D ), bearing 
in mind that the number of auxiliary functions and the 
number of quadrature points are both proportional to 
0(Mj?). However, for each Y, we can perform the trans- 
formation in one dimension at a time (e.g., replacing A x 
with P x , etc), reducing the formal scaling to 0{M^ D+1 ). 
Similarly, the separability of the X and Y factors is crit- 
ically important to reduce the scaling of the generalized 
pairing term, 

Aj...„ = (i . . . n\V\i' . . . n')Ki>... n >. (10) 

This contraction of a rank- iVD tensor with the integral 
tensor is in fact the worst possible scenario, as far as 
the DF representation is concerned, since the compound 
contraction index involves all N DF coefficient tensors, 

*i... n = d&...d» n ,G A " N (11) 

In general, this term will always scale as 0(M 2ND ) with 
both conventional and DF approaches, though the com- 
putational pre-factor is markedly higher in the latter 
case. When using X-THC, the generalized pairing term 
now reads, 

Aj...„ = (i . . . n\V\i' . . . n')Ki>... n > 
= Xf . . . [X* n w [z p - w [X P . . . [X% «,...„,]]]] . (12) 
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Any contraction involving X here would naively scale 
as 0(M^ D+D ). However, for each X, we can perform 
the transformation in one dimension at a time (e.g., re- 
placing i' x with P x , etc), reducing the formal scaling to 

A common technique (usually an approximation) is to 
assert a w-separable form for the potential, 

N„ D 

V(r u , r N ) = g II > ' • ' < W)- (13) 

In this case, the integral tensor factors as, 

N w D 

W — l (X=l 

(14) 

This allows the pairing tensor to be computed in 
0{N w M* D+N ) via several intermediates. For X-THC, 
a ^-separable potential allows for separability of the Z 
operator, i.e., 

N w D 

(i...n\V\i'...n>) = J2l[ 

W — l f_l — l 

This allows the pairing tensor to be computed in 
0{N w M^ D+1 ). Note that this is the same formal scaling 
as X-THC in a general local potential, but the memory 
usage for Z is lower, and the prcfactor may or may not 
be lower, depending on how large N w is. 

DEMONSTRATION CODE 
Overview 

To support the mathematical demonstration of exact 
X-THC resolution of the potential integral tensor, and 
to provide a practical example of the scaling gains pro- 
vided by X-THC, a mixed MATLAB/C++ code was de- 
veloped for the case of Z?-dimensional Hcrmite functions 
with A-body w-contracted Gaussian forces. In this code, 
the required potential integrals (in the primary or aux- 
iliary basis) and possible X-THC factors are generated 
in MATLAB and written to disk, for the given basis size 
Mfj_, dimensionality D, number of bodies N, and number 
of w-contraction points N w . For each integral technology 
and u;-separable vs. w-nonseparable case, the generalized 
pairing field is computed for a randomly generated pair- 
ing tensor in a standalone C++ code for the particular in- 
separability and integrals technology case. In each C++ 
code, the integrals and factors are read in, the w indices 
contracted over first if simulating a non-separable force, 
and then the generalized pairing tensor is computed ac- 
cording to the algorithms discussed below. 



MATLAB was chosen for the integral and factor gen- 
eration routines due to ease of implementation, and par- 
ticular strength in treatment of arbitrary rank tensors. 
As this portion of the total procedure would typically be 
performed as a single-use overhead step (e.g., before HFB 
iterations or before application of the integrals in corre- 
lated methods), we do not include this step in the tim- 
ings for the generalized pairing tensor, and thus there is 
no penalty for using the interpreted and rather memory- 
nai've MATLAB language for this stage. For the heavy 
linear algebra work of pairing tensor formation, C++ was 
selected for its "close to the metal" properties, particu- 
larly including explicit control of memory allocation and 
ability to swap pointers without performing explicit deep 
copy operations. Wherever possible (and for absolutely 
all contraction operations), BLAS calls are used, with 
the algorithms designed so as to allow for permutation of 
memory to be hidden in BLAS3 or BLAS2 operations as 
much as possible. The algorithms were formulated so as 
to rely preferentially on the BLAS3 DGEMM operation, 
followed by the BLAS2 DGEMV, followed by various 
BLAS1 operations. Every effort was expended to pro- 
duce well-optimized algorithms, with the same amount 
of optimization present in both the X-THC and conven- 
tional methods. With these considerations, we believe 
that the timings reported are entirely representative of 
a practical application of the various integrals technolo- 
gies, and show a wholly fair comparison of X-THC and 
conventional methods. 

The C++ codes are compiled with the Intel iepe 
12.0.1 compiler, using -03 optimization. The BLAS 
calls are handled with Intel's very efficient MKL 7.0.1 
library, with threading disabled. Performance measure- 
ments are performed using the PAP I 5.0.1 library, which 
features a wall timer accurate to ~ 1 microsecond. Ac- 
cumulated averaging was performed to ensure that all 
pairing kernels ran for at least 1 second of wall time, 
which ameliorates startup and noise costs for small prob- 
lem sizes. All timings were produced on a single-socket 
node featuring a quad-core 3.4 GHz Intel i7 Processor 
(Sandy Bridge) with 32 GB of DDR3 and 8 MB of L3 
cache. All timings are for wall times using a single thread. 

Note that we do not show results for density-fitted al- 
gorithms in this study, though we have coded generalized 
pairing routines using this approximation. All numeri- 
cal results are essentially the same for conventional and 
X-THC approaches (e.g., numerically exact to within a 
small pre- factor of the machine epsilon) . For w-separable 
forces with N > 1, the optimal pathway is to contract 
the DF intermediates to the conventional integrals, and 
then form the pairing tensor in the conventional man- 
ner. As a result, the DF timings results for w-separablc 
potentials are essentially indistinguishable from the con- 
ventional case. For non-w-separable potentials, the DF 
algorithms require the same 0(M^ ND ) scaling as the con- 
ventional case, but the pre- factors are many orders of 
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magnitude larger, due to the larger auxiliary basis sizes 
involved. Moreover, in the non-w-separable case, forma- 
tion of the conventional integrals from the DF integrals 
exhibits a higher scaling of 0(M 2ND+1 ), and is therefore 
not a viable alternative. In any case, the conventional 
algorithm always outperforms the DF algorithm for the 
generalized pairing tensor, so we have elected to not in- 
clude the DF results here. This failure of DF methods 
for "exchange-like" contractions is well known, and was 
the primary motivation for the development of the THC 
representation. 

Basis/Potential Choice 

Hermite Function Primary Basis 

The primary basis functions chosen for this demonstra- 
tion are direct products of generalized Hermite functions, 

D 

fj- 

Below, we will drop the n labels and work in ID (r = 
x) unless otherwise noted. The ID generalized Hermite 
function is, 

^(x) = {b^ 2 (V^m-^H^z) exp(-z 2 /2), (17) 

S v ' 

where Hi is the i-th Hermite polynomial and the nondi- 
mensional coordinate is, 

z = b^x. (18) 

i runs from to M p . Note that we reserve (f>i(z) for the 
true non-dimensional Hermite function, where = 1 . 

The auxiliary functions for this problem are also gen- 
eralized Hermite functions, 

X a(x) = (y26 M ) 1 /2(^A!)- 1 /2 jffA ( z ' ) exp(-z' 2 /2), 

(19) 

where now, 

z' = V^b^x, (20) 

and A runs from to 2-M^. The THC quadrature for 
this problem is thus the 2M^ + 1-node Gauss-Hermite 
quadrature with a spatial length scale of V^b^. 

Gaussian Potential 

For flexibility, we use for the potential a linear combi- 
nation of Gaussians. Given the set {< a w ,fi w >}, the 
form of the potential is, 

N m N-l N 

V(xi,...,x N ) = X] a ™ D cxp(-/3 w x 2 ^). 

w rj=l £=r/+l 

(21) 



For a 2-body potential in 1 dimension, with N w = 1 this 
reduces to the usual Gaussian force, 

V(x u x 2 ) = a 1 J D C M-P w x\ 2 ). (22) 

For higher N, this is simply a sum of two-body Gaus- 
sian forces over all possible pairs of two-body coordinates, 

We use a set of {< a w ,/3 w >} with N w = 8 which 
approximates the Coulomb operator 1 / r\ 2 for all compu- 
tations shown in this work. This allows us to show sepa- 
rate accuracy and timings results for the w-separable and 
non-tc-separable cases, depending on whether we choose 
to sum over the w index first or last. 



Potential Integrals (MATLAB) 

All of the ZJ-dimensional iV-body potential integrals 
above can be constructed from primitive 1-dimensional 
2-body potential integrals. For conventional integral ap- 
proaches, four-index integrals in the primary basis are 
required. Within X-THC, two-center integrals in the 
auxiliary basis are required. Integrals of this type are 
discussed extensively in [1], where Moshinski transforma- 
tions and length-scale transformations are used to pro- 
vide analytical conventional integrals (the generalization 
to auxiliary integrals is straightforward). 



Conventional Integrals 

Following [1], the 1-dimcnsional 2-body primary inte- 
grals are computed as, 

(ij\V w \i'f) = a]/° jj d Xl dx 2 MxiWjfa)* 

cxp(-/3 tu a; 2 2 )V'i'(xi)V'i'(x2) 
= ali D D^ n D^ n ,D aal {r ha )D bal {r ha ), (23) 

where all summations run from to 2M p . D^ n are the 
Moshinski transformation coefficients, and D aa i(r] w ) are 
the length-scale transformation coefficients (see below). 
The length-scale change parameter is 

Vw = , 1 (24) 

yjl + 2p w /bl 
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Auxiliary Integrals 

By extension, the 1-dimensional 2-body auxiliary inte- 
grals are computed as, 



G 



AB 



II dxidx2 X a ( Xi ) cx p(~P wX i2)Xb(x2) 
--D*&D nn ,(rh,)I N I n ., (25) 



V2b^ + 

where again all summations run from to 2M M . The 
In quantities are primitive integrals over single Hcrmitc 
functions of unit length, 



IN 



I 



dz ip N (z) 



(N-iy.< AT 

- — : — - — /V even 

even (26) 
N odd V ' 



Moshinski Transformation Coefficients D 



Aa 
mn 



The Moshinski transformation coefficients relate prod- 
ucts of Hermite functions in Eulerian coordinates x\ and 
x 2 to corresponding Hermite functions in the Lagrangian 
coordinates X and x, where, 

X=-^[x 1 +x 2 ], x=-^[x 1 -x 2 ], (27) 

The transformation is 

Vv(*i)VU^) - D»;l^ N {X)^ n {x). 



(28) 



As the Gaussian potential is central (i.e., depends only 
on x, not X), invoking the Moshinksi transformation re- 
duces the two-coordinate potential integrals to a sepa- 
rable product of one-coordinate integrals in X and x. 
If ni and n 2 each range from to M^, N and n each 
range from to 2M M . The well-known selection rule is 
ni + n 2 = N + n. A simple, explicit formula for these 
coefficients is [1] 



ly ni,n 2 ~~ °n 1 +n 2 ,N+n 



ni\n 2 \ 
N\n\ 



V2) 



N+n 



i<N ',j<n,i-\-j=ni 

£ 

i,j=0 



(-iy. (29) 



However, this formula is unstable for large values of 
N + n due to the summation over alternating quantities 
which are both large. We have therefore derived stable 
recurrence relations for the Moshinksi coefficients. The 
recurrence relation in the first coordinate is, 



D 



N,n 

n 1 +l,n 2 



1 



m + i nun2 



lN,n-l 



D 

m + 1 " 1>na 



(30) 



And the corresponding recurrence relation in the other 
coordinate is, 



D 



N,n 

™1 ,"2 + 1 



1 

71 



n 2 + l ni ' n2 



~,N-l,n 



11 p)N,n-l 

n 2 + l 111 '™ 2 ' 



where 



D o,o _ , 



(31) 
(32) 



For implementation purposes, any Moshinski transforma- 
tion coefficient with a "negative" index can be taken to 
be zero. 



Length-Scale Transformation Coefficients D nn i(r) w ) 

The length-scale transformation coefficients provide 
the correspondence between Hermite polynomials of dif- 
ferent length-scales, 

H n (z) = {r])- l ' 2 D nnl {n)H n ,{z' = z/n) (33) 

Note that the transformation automatically renormalizcs 
the polynomials, n and n' both run from to M M . 
A closed-form expression of these coefficients is [1] 



D nn ,{rj) = F„ >n /(-l) 2 



n!\ 1/J f|"'+ 1 / J (l-^) s ? 



n' 1 



(34) 

on the condition that n > n', and that the parity agrees, 

/w-in+f-ir-'i 3? m 

Because f] < 1, this formula appears to be univer- 
sally stable. However, the numerators and denomina- 
tors above both involve divergent values, so a log-space 
formalism is required for explicit evaluation to prevent 
overflow. This may lose a few digits of precision, so we 
have elected to use a recurrence relation for these coef- 
ficients, which is easily derived. The recurrence relation 
is, 



Ai+l,n' = V 



where 



in'- 









n+ 1 



n + 1 



n.n' — 1 



■jfln-iy- (36) 



(37) 



For implementation purposes, any length-scale transfor- 
mation coefficient with a "negative" index can be taken 
to be zero. 
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D -Dimensional, N-Body Integrals 

The generalization to TV-body integrals as described 
above is carried out at the 1-dimensional stage (before 
the product over \x is carried out), e.g., for primary-basis 
integrals in the 3-body case, 

(ijk\V w \i'j'k') = (ij\V w \j'j')5 kkl 

+ (ik\V w \i'k')S jr + (jk^lfk'jSu,. (38) 

For auxiliary basis integrals, there is no delta function 
in the third coordinate, but rather a normalized primi- 
tive Hermite integral, i.e., the quantity In defined in the 
auxiliary potential integrals. Thus, the auxiliary poten- 
tial integral is, 

G ABC = G^ B Ic + G^ c Ib + G bc Ia- (39) 

In practice, the generalization to 3-body integrals is per- 
formed in MATLAB on the 1-dimensional integrals, be- 
fore integrals are written to disk. 

The generalization to non-w-separable D-dimensional 
potentials is carried out by summing over w, e.g., in the 
2-dimensional, 2-body case, 

(ij\V\i'f) = ^(iMV w \i'JJ(i y j y \V w \i' y j' y ), (40) 

w 

or, 

g ab = j2 G ^ B=cG ™ yBy - ( 41 ) 

w 

In practice, the production of the nonseparable Z fac- 
tors or conventional integrals is carried out in blocks in 
C++, to save memory. The formation of these integrals 
is not counted in the pairing timings, as infinite memory 
is assumed. The overhead for this is many orders of mag- 
nitude larger for the conventional case than the X-THC 
case, due to the larger size of the rank-2A^D conventional 
integral tensor. 

X-THC Factors (MATLAB) 

To complete the X-THC factorization, the Gauss- 
Hermite quadrature nodes/weights and collocation ma- 
trices X and Y are required. 

Quadratures 

The THC grid for this problem is the 2M M + 1 node 
Gauss-Hermite quadrature with the spatial range param- 
eter V^b^. 



First, the non-dimensional quadrature is generated. 
The orthonormal-basis position operator Apq = (P\x\Q) 
in the auxiliary Hermite functions is, 



Xpq = (P\x\Q) = V —j- (tp+hQ + Sp,q+i) ■ (42) 

This operator is symmetric tridiagonal, with zero diago- 
nal. The eigendecomposition is formed, 

Xpq = Qpp'Xp'Qqp' . (43) 

The eigenvalues are the nondimcnsional quadrature 
nodes. The weights are determined by first generating 
the moment vector, 

v P = ^Spo, (44) 

and applying the diagonalizing transformation, 

v P i = Qpp>vp. (45) 

The full weights are then given by 

wp> — vp, exp(xp,). (46) 

The nodes and weights are then transformed to the prob- 
lem domain, by 

xp> Wp> 
x P = — — — , to? = — — . (47) 

Collocation 

The X and Y X-THC factors require collocations at 
the quadrature nodes. This is accomplished by efficient, 
stable recurrence relations for the Hermite functions. 

The non-dimensional coordinate is first computed, 

z = b^x. (48) 

The first two non-dimensional Hermite functions are 
computed explicitly, 

Mz) -7r- 1 / 4 exp(-z 2 /2), (49) 

and 

V>i(z) = V2z7r- 1/4 exp(-z 2 /2) = V2zip (z). (50) 
The recurrence relation is then applied iteratively, 



And finally the length-scale normalization is added, 

Uz) = (b^/^iz). (52) 
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The X factor is, 

Xf = ^(xp) = (K) 1/2 A(z = b^ P ), (53) 
and the Y factor is, 

Ya = w pX a(xp) = wpiV^b^/^jAiz' = V2b^x P ). 

(54) 

If desired, the DF 3-index overlap integrals can be gen- 
erated exactly (within numerical precision) from the X- 
THC factors, 

[ii'A] = XfX.?Y£. (55) 

The X-THC Z operators are immediately formed from 
the Y factors and corresponding G auxiliary potential 
integrals, 

Z?=YZY§G*». (56) 

Generalized Pairing Algorithms (C-\ — I-/BLAS) 

Algorithms for the generalized pairing tensor with con- 
ventional or X-THC integrals, and general or w-separable 
potentials, arc described below and depicted in Algo- 
rithms 1-4. In these algorithms, we use ellipses to denote 
arbitrary rank in D or TV, and we follow the convention 
that dimensions are striped as the slow superindex, and 
particles are striped as the fast superindex [3]. The ten- 
sors used in these algorithms are stored in practice as a 
collapsed single-dimensional array (a double*), regularly 
striped so that the left-most index is the fastest dimen- 
sion and the right-most index is the slowest dimension 
(Fortran order, which allows for convenient application 
of BLAS operations). For key contraction operations, 
we group sets of neighboring indices to form the three 
superindices i (result row or fast index), j (result col- 
umn or slow index) and k (contraction index) for use in 
GEMM, 

<",, A ik B kj . (57) 

Note the use of color to emphasize the choice of su- 
perindices. By changing the order of A and B and al- 
tering the GEMM transposition arguments, we can take 
z, j, and k in any order in the factor tensors A or B 
(e.g., the contraction index could actually be the row di- 
mension in A). However, striding is not permitted with 
BLAS3 operations, e.g., a tensor contraction of the form, 

C ijk = A u B jlk . (58) 

would require explicit transposition to Bij^ or Bj k i to be 
able to group the compound jk index. Our algorithms 
are designed to eliminate such explicit transposition as 
much as possible. However, no transposition-free algo- 
rithm exists for X-THC w-separable potentials in arbi- 
trary N, see the discussion below. 



Note that our discussions and codes all assume 
isotropic basis sets for simplicity (e.g., M x = M y ), but 
X-THC is certainly not restricted to this. In general, X- 
THC can handle differing values of M x and M y , and even 
different classes of basis functions in each dimension (e.g. 
cylindrical coordinates). 



Conventional Integrals, General Potential 

See Algorithm 1. This algorithm is quite simple, but 
extraordinarily expensive (this is why non-w-separable 
forces are rarely used in high D or N cases). The physi- 
cists' integral tensor is used in a simple matrix-vector 
product with the generalized pairing tensor, which is car- 
ried out by GEMV in M^ ND = 0{Ml ND ) operations. 

In practice, such an algorithm will almost certainly be 
orders of magnitude more expensive than reported here. 
In our demonstration, we simulate infinite memory by 
forming blocks of the integrals prior to GEMM (i.e., an 
integral direct procedure). The integral generation costs 
0(M^ ND ) (in this case by contracting out the w index) 
with a much larger prefactor than the GEMV itself. In 
a fully generic potential, the integrals would have to be 
generated explicitly, at possibly even higher cost. This 
integral formation is not counted in the wall times re- 
ported here, but would increase the practical wall time 
considerably in the usual case that the rank-2A r Z? inte- 
gral tensor does not fit in core memory. 



X-THC Integrals, General Potential 

See Algorithm 2. This algorithm works by cycli- 
cally transforming from configuration space i' to the 
quantized coordinate space P^ by GEMM, scaling the 
coordinate-space pairing tensor by Z , and then perform- 
ing another cyclic transformation to bring P^ back to i„ 
via GEMM. The cyclic transformations are written so 
that the fast index in the current buffer is contracted off, 
and the replacement index is placed at the slow index of 
the result buffer. The pointers for the result and current 
buffer are then swapped, and the next contraction index 
automatically appears in the fast index of the current 
buffer, without any need for explicit transposition. 

Formally, 0(M^ D+1 ) operations are required for the 
cyclic permutations. The prefactor arises from the ~ 
size (per dimension) of the quadrature index, and 
from the fact that two cyclic permutations are required. 

This algorithm currently requires essentially three 
buffers of size 2 ND M^ D (T, U, and Z). An efficient 
blocked or disk-based algorithm could be developed if this 
becomes a bottleneck, with considerable performance 
gains expected over conventional disk-based or integral- 
direct algorithms with general potentials. 
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Conventional Integrals, w-Separable Potential 

See Algorithm 3. This algorithm works by cyclically 
applying the integrals for dimension /j, via GEMM, for 
each w point. Explicit transposition is avoided by plac- 
ing the replacement index (i . . . n)^ as the slow index of 
the result, allowing (i' . . . n')n+i to be exposed as the 
new fast index. The formal scaling of this algorithm is 
0(N w M^ D+N ) operations, with remarkably small exter- 
nal overhead. The memory requirement is essentially 2 
M^ D buffers (T and U). The small overhead, combined 
with the efficiency of the long (i' . . . n')^ contraction in- 
dex and small memory footprint explains much of the 
current success of w-separable potentials. 

X-THC Integrals, w-Separable Potential 

See Algorithm 4. This algorithm works by, for each w 
point and dimension fi, cyclically forward transforming 
from i' to P^, applying the Z^f "' W ^ operation, and 
then cyclically backtransforming from P^ to i^. The 
formal scaling is 0(N w M^ D+1 ) FMA (fused multiply- 
add) operations. The required memory is essentially two 
2 n m nd buffcrs T and c/j a factor of 2 N more than the 

conventional w-separable algorithm. 

Unfortunately, an explicit transposition is required for 
this algorithm for general N. The genesis of this require- 
ment is that the cyclic permutation is not carried through 
all ND coordinates, but only N coordinates at a time. 

In the 2-body specialized algorithm can be ap- 

plied to avoid the explicit transposition. The transfor- 
mation in each dimension reads, 

Uj'LP <— Ti'jiLXf 

Uqlp Tj/ LP X^ 
Uqlp 4- T QLP Z^ Q 
Uqli <- Tq LP X[ 

(59) 

Here L is the compound index corresponding to the other 
dimensions, and pointer swap is assumed between each 
step. 

One additional modification can attenuate the prefac- 
tor somewhat, at the cost of doubling the buffer space. 
The forward transformation of the first dimension from 
K (i'j')iL to Ulipq) 1 is the same for all w points, as k 
is u>-independent. Moreover, the back transformation of 
the last dimension from Ul(pq) d to A L (ij) D does not de- 
pend on the individual w points, but only on their sum. 
Thus the buffcrs Ul(pq) 1 an d Ul(pq) d can be used in a 



prelude/epilogue construct. In the limit that N w — > oo, 
the savings are 100%, 50%, and 33%, for D = 1, 2,, and 
3, respectively. In the limit that N w — 1 or D — > oo, the 
savings approach 0%, so the deployment of this modifi- 
cation would depend greatly on the context and memory 
capacity. 

Computational Results 

Timings and accuracy results for generalized pairing 
tensor formation are shown in Figures 1 and 2, respec- 
tively, for various integral technologies, dimensions, num- 
ber of bodies, and problem sizes. To help crystallize the 
information in the timing data, asymptotic scaling and 
predicted crossover metrics are presented in Tables I and 
II, respectively. 

The accuracy results of Figure 2 depict the relative 
maximum residual R in the pairing tensor, defined as, 



II A Method A Reference | 
|| i...n i...n 


1 CO 




| A Reference | 
| i...n 


1 oo 



(60) 



Here the w-separable conventional integral technology 
was selected as a reference for the accuracy [4]. It is 
apparent that all methods (conventional and THC) are 
exact to within a reasonable growth factor against the 
machine cpsilon. In fact, the worst relative maximum 
residual seen here is less than 10~ 12 (compared to the 
double-precision machine epsilon of 2.2x 10 -16 ), and does 
not seem to be grow markedly with respect to problem 
size. This provides strong numerical evidence that X- 
THC is a lossless compression of the potential integral 
tensor. Further agreement could doubtless be obtained 
if the Moshinksi coefficients were evaluated with higher 
precision [5]. 

The timing results of Figure 1 are quite clean as tim- 
ings go, with very smooth increase with respect to prob- 
lem size, indicating that the accumulation procedure 
(smaller M M ) or sheer size of the problem (larger M^) 
are sufficient to eliminate the bulk of the noise that of- 
ten plagues timing studies. On a log-log scale, all timing 
curves exhibit a slight upward concavity which quickly 
tends toward linearity for larger M M as the rate-limiting 
DGEMM-based steps become dominant relative to the 
lower-scaling operations. Two noticeable "jumps" exist: 
the last few points of the 2-body, 1-dimcnsional conven- 
tional separable case and the last few points of the 2- 
body, 3-dimensional THC separable case. These jumps 
are likely due to working set size limits exceeding some 
discrete performance threshold in hardware, for instance, 
cache or memory bank limits, respectively. From these 
plots, power-law regression of the form, 

tpairing = O^, (61) 

is performed, using points selected above an M M which 
appears to be visually free of noise in each case. The j3 
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from these regressions are depicted in Table I. The pre- 
dicted THC crossover points from these regressions (the 
critical M M at which X-THC becomes practically superior 
to conventional integrals) are shown in Table II. For all 
cases in which an explicit crossover occurs, the predicted 
crossover point from the power-law model is within 1 M M 
of the observed value. Note that the observed asymptotic 
scaling is often less than the theoretical value. There are 
two geneses for this: residual contributions from lower 
scaling operations (which drags the scaling down at the 
cost of prefactor) and better GEMM/GEM V efficiency 
for larger matrix sizes (which makes GEMM/GEMV ap- 
pear to scale better than cubic/quadratic as the matrix 
size increases). However, the relative scaling relation- 
ships between all integral technologies is retained. 

From this point forward, it is useful to consider general 
and w-separable potentials separately. 

For all cases of N and D in general local potentials, 
X-THC is markedly more efficient than conventional ap- 
proaches. In all such cases, X-THC crosses over conven- 
tional (often at very small M M for D > 1), and is several 
orders of magnitude faster for the largest cases shown 
with D > 1. This is strong evidence for the immedi- 
ate application of X-THC to problems involving general 
potentials. 

For w-separable cases, X-THC always exhibits lower 
asymptotic scaling than conventional, and provides 
crossovers and sometimes significant speedups for D = 1 
and D = 2. However, the narrower asymptotic sepa- 
ration between conventional and X-THC (/3conv-Sc P — 

^THC-Scp = N-l VS. /3conv-Gcn-/?THC-Gcn = ND-1) 

exposes the prefactor of X-THC, particularly for larger 
D. This prefactor has two geneses: the formal FMA 
prefactor due to successive substitution of primary ba- 
sis indices for larger quadrature indices and the lower 
efficiency of X-THC DGEMM operations compared to 
conventional DGEMM operations. For a 3-body, 3- 
dimensional w-separable potential, the crossover seems 
likely to occur just outside of the memory-limited prob- 
lem size explicitly shown in Figure 1. In fact, the pre- 
dicted crossover point for this case is M M = 10.3. The 
2-body, 3-dimcnsional w-separable potential is somewhat 
more sinister: a "jump" in the timings curve caused by 
some hardware boundary causes the expected crossover 
point to increase to a practically unreachable value of 
= 1789. 

The inability of X-THC to provide a practical crossover 
for the 2-body, 3-dimensional w-separable potential is an 
indication that this technique is not a panacea. How- 
ever, we point out that the utilization of a w-separable 
potential throughout the literature seems to stem from 
the lack of an X-THC representation for a general poten- 
tial: the approximation of the potential as w-separable 
was required to provide a tractable numerical recipe. In 
this light, X-THC treatment of general potentials pro- 
vides a practical alternative pathway that avoids approx- 



imation of the potential as w-separable. For all of the 
cases studied here, the general X-THC curve is within 
roughly an order of magnitude of the conventional sep- 
arable curve (in some cases even faster!). Moreover, X- 
THC does provide some gains in w-separable potentials, 
particularly for larger TV. On more formal grounds, the 
demonstrated lossless asymptotic reduction of a general- 
ized pairing term in any local potential from 0(M^ ND ) to 
0(M^ D+1 ) is a useful result in and of itself, considering 
that simply storing the pairing tensor requires 0{M^ D ). 



* Electronic address: schunckl@llnl.gov 

t Electronic address: sherrill@gatech.edu 

* Electronic address: toddjmartinez@gmail.com 
[1] L. M. Robledo, Phys. Rev. C 81, 044312 (2010). 

[2] An example is molecular physics in an atom-centered ba- 
sis: the number of basis functions is always proportional 
to the number of particles in the system, and has no de- 
pendency on the 1-, 2-, or 3-dimensional nature of the 
molecular geometry. 

[3] By "superindex" we indicate a generalized index which is 
composed as a concatenation of several quantities which 
would normally be considered to be indices in their own 
right 

[4] Note that there is some ambiguity here, as some round- 
off error is intrinsic to the w-separable conventional refer- 
ence itself. A particularly marked source of roundoff error 
is the Moshinski relations for the potentials. Since both 
the w-separable and general conventional pairing routines 
share the same underlying Moshinski relations, superior 
agreement between these two conventional methods and 
the THC methods does not necessarily imply that one is 
more accurate than the other, against a hypothetical ex- 
act precision result. In any case, the agreement between 
all methods is sufficient that the point is moot. 

[5] This is the most numerically susceptible portion of the 
procedure, even when using recurrence relations. 
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Algorithm 1 Generalized pairing algorithm: conventional integrals, general potential. 

1: procedure PAlRlNG_CONV_GEN(((i . . . n)i ... (i ... n)o|V'| (i . . . n')i ... (i' ... n')o}, «(i'...n')i...(»'...n')r> ) 

2: Allocate A(i... n ) 1 ...(i... n ) D > Target (Typically Preallocated) 

3: A (i ... Tl)l . ..(*..,,;„ = <(i . . .n)i . . - (i . . . n) D \V\(i' . . . n')i ...(%'... n')B>««'...»')i...«'...»') D > GEMV, 0(M^ D ) 

4: return A (l ... n)l ... (i ... n)D 

5: end procedure 



Algorithm 2 Generalized pairing algorithm: X-THC integrals, general potential. 
1: procedure Pairing_THC_Gen(X^, z {p - w)i - {p - w)d , K( i '...„/) 1 ...( i /...n')r,) 

2: Allocate .,. n ')i...U' ...n') D ^ Target (Typically Preallocated) 

3: Allocate Trp...w)i...(P...W)D > Scratch Array (Typically Preallocated) 

4: Allocate U(p...w) 1 ...(p...w) d > Scratch Array (Typically Preallocated) 

5: T(i'... n ') l ...{i'...n') D = «(i'...n')i — (i'-n')D > Deep Copy 

6: for all fi G [1,-D] do > Start Cyclic Permutation in p and rj 

7: for all X] G [1,JV] do 

8: ^...,%...(p...^.(p)p = ^ P T (i0M( ...„ % ... (J .... w)M _ 1 > GEMM, 0(M^ +1 ) 

9: swap(T, U) > Pointer Swap 

10: end for 

11: end for > End Cyclic Permutation in (i and rj 

12: T {p ,.. w)i ... (p ,., w)d * = Z {P-W)i-<,p-W)d > Hadamard Product, 0(M^ D ) 

13: for all fx G [1,-D] do > Start Cyclic Permutation in fi and r\ 

14: for all r? G [1,N] do 

15: ^...w) M ...(i...») M _i(0p = ^ P "2 T (P) M (... W %...(i...„) f ._ 1 > GEMM, 0(M™) 

16: swap(T, (7) > Pointer Swap 

17: end for 

18: end for > End Cyclic Permutation in fi and t] 

19: A( i ... n)l ...( i ...„) D = T(i...„) 1 ...(i...») D > Deep Copy 

20: return A {l ... n)l ... (i ... n)D 
21: end procedure 
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Algorithm 3 Generalized pairing algorithm: conventional integrals, ^-separable potential. 
1: procedure Pair.ing_Conv_Sep((i . . . n\V\i' . . . n')™, 

2: Allocate ^■u.., n ) l ...(i...n) D =0 > Target (Typically Preallocated) 

3: Allocate T , (j'...„/) 1 ...(i'... n ') D > Scratch Array (Typically Preallocated) 

4: Allocate J/(i'... n ')i--(i'---n')D > Scratch Array (Typically Preallocated) 
5: for all w G [1,N W ] do 

6: T$>... n ') 1 ...(i>...n>)B = K (i'-n')i.:(.i'-n') D > Deep Copy 

7: for all /i G [1,-D] do > Start Cyclic Permutation in /j, 

8: ^(i... n ) M _ 1 (i...») M =<<--.n|^-.-n^^'... B %...Ci... B ) M _ 1 > GEMM, 0(N w M^ D+N ) 

9: swap (7™, (7™) > Pointer Swap 

10: end for > End Cyclic Permutation in /i 

11: A(i...„) 1 ...(i... n ) D + = r ( I.. n)l ... (i ... n)D > Contribution from w 
12: end for 

13: return A {i ... n)l ... (i ... n ) D 
14: end procedure 



Algorithm 4 Generalized pairing algorithm: X-THC integrals, w-separable potential. 

P ( P W) 

1: procedure PAlRlNG_THC_SEP(A i M , 

2: Allocate A^/..^/^. ..u'... n ') D > Target (Typically Preallocated) 

3: Allocate T(p...vK)i...(i...n)D ^ Scratch Array (Typically Preallocated) 

4: Allocate f/(_p...wn 1 ...(i...n)D > Scratch Array (Typically Preallocated) 

5: for all w G [1,N W ] do 

6: 2V...n')i...(i'...n') D ~ K (»'...n')i-.-(i'-.n')D > Deep Copy 

7: for all /i G [1, -D] do > Start Cyclic Permutation in /j, 
8: for all 77 G [1,7V] do 

9: !7(...n % ...( l ...n Wl (P) M = X^T^^...^,..^^ > GEMM, 0(/V I1) M M ArD+1 ) 

10: swap(T, (7) > Pointer Swap 
11: end for 

12: r... (1 .„ n) „_ l(P ...wr) * = Zi p - w) » > SCAL, 0(N w M^ D ) 
13: for all rj G [TV, 1] do 

14: (7(„),...( l ... n)M _ 1 (P...) (1 = X„ f ,»Vr ... ft) „_ l(P ... M w V > GEMM, 0(/V ro Af^+ 1 ) 

15: swap(T, (7) > Pointer Swap 
16: end for 

17: C...(i...n)„_i(i...n) (1 = 7 , (i...n) M ...(i...n) A _i > Explicit Transposition 

18: swap(T, (7) > Pointer Swap 

19: end for > End Cyclic Permutation in /j, 

20: A (i ... n ) 1 ...( i ... n ) D + = T( l ...„) 1 ...( i ...„) D > Contribution from w 

21: end for 

22: return A (l ... n)l ... (i ...„ )D 
23: end procedure 
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FIG. 1: Timings for generalized pairing tensor formation vs. problem size for various integral technologies, numbers of bodies 
N, and number of dimensions D. 
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FIG. 2: Relative maximum residual for generalized pairing tensor formation vs. problem size for various integral technologies, 
numbers of bodies N, and number of dimensions D. 
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TABLE I: Asymptotic scalings of generalized pairing tensor formation with various integral technologies. Shown are observed 
scalings based on power-law fit to timing data and corresponding theoretical scaling (in parentheses). 



D N /3conv-Scp /^THC-Scp /^Cony-Gen /^THC — Gen 

12 4.5 ( 4) 1.9 ( 3) 3.9 ( 4) 1.8 ( 3) 

13 5.8 ( 6) 3.0 ( 4) 5.5 ( 6) 3.2 ( 4) 
2 2 5.3 ( 6) 4.4 ( 5) 7.3 ( 8) 4.6 ( 5) 

2 3 8.3 ( 9) 6.1 ( 7) 10.5 (12) 5.5 ( 7) 

3 2 7.2 ( 8) 7.0 ( 7) 10.3 (12) 5.8 ( 7) 
3 3 8.9 (12) 7.3 (10) 13.7 (18) 7.0 (10) 



TABLE II: Predicted THC vs. conventional crossover points for generalized pairing tensor formation based on power-law fit to 
timing data. 



D N M^ cn Mf^ 

1 2 1.224E+01 1.172E+01 

1 3 3.429E+00 5.170E+00 

2 2 2.194E+00 1.657E+01 

2 3 2.304E+00 7.928E+00 

3 2 2.024E+00 1.789E+03 
3 3 1.650E+00 1.035E+01 
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